Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning

نویسندگان

Usama M. Fayyad

Keki B. Irani

چکیده

Since most real-world applications of classification learning involve continuous-valued attributes, properly addressing the discretization process is an important problem. This paper addresses the use of the entropy minimization heuristic for discretizing the range of a continuous-valued attribute into multiple intervals. We briefly present theoretical evidence for the appropriateness of this heuristic for use in the binary discretization algorithm used in ID3, C4 , CART, and other learning algorithms. The results serve to justify extending the algorithm to derive multiple intervals. We formally derive a criterion based on the minimum description length principle for deciding the partitioning of intervals. We demonstrate via empirical evaluation on several real-world data sets that better decision trees are obtained using the new multi-interval algorithm. Introduction Classification learning algorithms typically use heuristics to guide their search through the large space of possible relations between combinations of attribute' values and classes. One such heuristic uses the notion of selecting attributes locally minimizing the information entropy of the classes in a data set (d. the ID3 algorithm (13) and its extensions, e.g. GID3 (2), GID3* (5), and C4 (15), CART (1), CN2 (3) and others). See (11; 5; 6) for a general discussion of the attribute selection problem. The attributes in a learning problem may be nominal (categorical), or they may be continuous (numerical). The term continuous" is used in the literature to refer to attributes taking on numerical values (integer or real); or in general an attribute with a linearly ordered range of values. The above mentioned attribute selection process assumes that all attributes are nominal. Continuous-valued attributes are discretized prior to selection , typically by paritioning the range of the attribute into subranges. In general, a discretization is simply a logical condition , in terms of one or more attributes, that serves to partition the data into at least two subsets. In this paper, we focus only on the discretization of continuous-valued attributes. We first present a result about the information entropy minimization heuristic for binary discretization (two-interval splits). This gives us: . a better understanding of the heuristic and its behavior 1022 Machine Learning :;. . formal evidence that supports the usage of the heuristic ;;' in this context , and . a gain in computational effciency that results in speeding . up the evaluation process for continuous-valued attribute discretization. We then proceed to extend the algorithm to divide the range of a continuous-valued attribute into multiple intervals rather than just two. We first motivate the need for such a capability, then we present the multiple interval generalization, and finally we present the empirical evaluation results confirming that the new capability does indeed result in producing better decision trees. Binary Discretization A continuous-valued attribute is typically discretized during decision tree generation by partitioning its range into two intervals. A threshold value for the continuous-valued attribute is determined, and the test ::

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discretization of Continuous-valued Attributes and Instance-based Learning

Recent work on discretization of continuous-valued attributes in learning decision trees has produced some positive results. This paper adopts the idea of discretization of continuous-valued attributes and applies it to instance-based learning (Aha, 1990; Aha, Kibler & Albert, 1991). Our experiments have shown that instance-based learning (IBL) usually performs well in continuous-valued attribu...

متن کامل

A New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining

Decision-tree algorithms provide one of the most popular methodologies for symbolic knowledge acquisition. The resulting knowledge, a symbolic decision tree along with a simple inference mechanism, has been praised for comprehensibility. The most comprehensible decision trees have been designed for perfect symbolic data. Classical crisp decision trees (DT) are widely applied to classification t...

متن کامل

Value Difference Metrics for Continuously Valued Attributes

Nearest neighbor and instance-based learning techniques typically handle continuous and linear input values well, but often do not handle symbolic input attributes appropriately. The Value Difference Metric (VDM) was designed to find reasonable distance values between symbolic attribute values, but it largely ignores continuous attributes, using discretization to map continuous values into symb...

متن کامل

Multi-interval Discretization of Continuous Attributes for Label Ranking

Label Ranking (LR) problems, such as predicting rankings of financial analysts, are becoming increasingly important in data mining. While there has been a significant amount of work on the development of learning algorithms for LR in recent years, preprocessing methods for LR are still very scarce. However, some methods, like Naive Bayes for LR and APRIORI-LR, cannot deal with real-valued data ...

متن کامل

Solving robot selection problem by a new interval-valued hesitant fuzzy multi-attributes group decision method

‎Selecting the most suitable robot among their wide range of specifications and capabilities is an important issue to perform the hazardous and repetitive jobs‎. ‎Companies should take into consideration powerful group decision-making (GDM) methods to evaluate the candidates or potential robots versus the selected attributes (criteria)‎. ‎In this study‎, ‎a new GDM method is proposed by utilizi...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1993

Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning

نویسندگان

چکیده

منابع مشابه

Discretization of Continuous-valued Attributes and Instance-based Learning

A New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining

Value Difference Metrics for Continuously Valued Attributes

Multi-interval Discretization of Continuous Attributes for Label Ranking

Solving robot selection problem by a new interval-valued hesitant fuzzy multi-attributes group decision method

عنوان ژورنال:

اشتراک گذاری